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BAYESIAN-MOTIVATED TESTS OF FUNCTION FIT AND THEIR 
ASYMPTOTIC FREQUENTIST PROPERTIES 

By Marc Aerts^, Gerda Claeskens^ and Jeffrey D. Hart^ 

Limburgs Universitair Centrum, K. U. Leuven and Texas A&M University 

We propose and analyze nonparametric tests of the null hypothe- 
sis that a function belongs to a specified parametric family. The tests 
are based on BIC approximations, ttbic, to the posterior probabil- 
ity of the null model, and may be carried out in either Bayesian or 
frequentist fashion. We obtain results on the asymptotic distribution 
of TTBIC under both the null hypothesis and local alternatives. One 
version of ttbic, call it ttbjq, uses a class of models that are orthog- 
onal to each other and growing in number without bound as sample 
size, n, tends to infinity. We show that \/n{l — t^bic) converges in 
distribution to a stable law under the null hypothesis. We also show 
that tt bic can detect local alternatives converging to the null at the 
rate y^logn/n. A particularly interesting finding is that the power of 
the TTBjQ-based test is asymptotically equal to that of a test based on 
the maximum of alternative log-likelihoods. 

Simulation results and an example involving variable star data 
illustrate desirable features of the proposed tests. 

1. Introduction. Consider a model in which the observed data vector 
Y has distribution f{y,g,T]), where / is known, g is an unknown function 
and ?7 is a vector of unknown nuisance parameters. We wish to test the null 
hypothesis that (7 is in a specified parametric family Q = {g{-;9):9 G 0} 
against the nonparametric alternative that g ^ G- This paper proposes a 
Bayes-inspired test of such a hypothesis. A version of the test was proposed 
by Hart (1997) in the special case of checking the fit of a parametric re- 
gression model. The idea is simple. Consider a sequence of models for g of 
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varying dimensions, one of which is the parametric (or null) model whose 
fit is to be tested. The posterior probability, vr^, of the null model is com- 
puted, and if this probability is sufficiently low, the null model is rejected. 
This test may be carried out in either Bayesian or frequentist fashion. One 
may determine a sequence of constants a.„ such that an{l — vr^) converges 
in distribution to a nondegenerate random variable when Hq is true and the 
sample size n tends to oo . This allows the frequentist to conduct a valid large 
sample test of given size based on a„(l — 7r„). On the other hand, a Bayesian 
may simply wish to make a decision based on the value of 7r„, irrespective 
of an a priori type I error probability. 

The idea of using a Bayesian-motivated statistic in frequentist fashion is 
not new. Good (1957) proposed that the distribution of a Bayes factor be 
computed on the assumption that a sharp null hypothesis is true, and P- 
values corresponding to the Bayes factor be used as a significance criterion. 
Good (1992) gives an extensive review of compromises between Bayesian 
and non-Bayesian methodologies. 

Lack-of-fit and goodness-of-fit tests based on orthogonal series expansions 
and/or smoothing ideas have received considerable attention in the last fif- 
teen or so years. Many references to this work may be found in the book 
of Hart (1997). Seminal references on series-based goodness-of-fit tests, that 
is, so-called smooth tests, are Neyman (1937) and Rayner and Best (1989, 
1990). More recently, Ledwina (1994) and Fan (1996) have proposed adap- 
tive versions of Neyman's smooth test. Eubank and Hart (1992) and Aerts, 
Claeskens and Hart (1999) have studied the so-called order selection test 
in the contexts of regression and general likelihood models, respectively. A 
nonparametric Bayesian goodness-of-fit test has been proposed by Verdinelli 
and Wasserman (1998). 

The rest of the paper is organized as follows. Section 2 considers frequen- 
tist and formal Bayesian versions of the proposed test, and discusses the 
choice of alternative models and specification of priors. Section 3.1 summa- 
rizes a simulation study comparing the power of our test with other omnibus 
lack-of-fit tests. In Section 3.2 our methods are applied to the problem of 
testing for a trend in the sequence of times between maximum brightnesses 
of the long-period variable star Omicron Ceti or Mira. Section 4 presents our 
theoretical results on the asymptotic frequentist properties of the proposed 
tests. Finally, the Appendix contains mathematical details and proofs of the 
theorems. 

2. Test procedures. To reiterate, we assume that observed data Y have 
distribution /(y,^,??) for some function g and vector of parameters rj. We 
wish to test the hypothesis, call it Hq, that the function g lies in the para- 
metric family of functions Q. The model which assumes that Hq is true 
will be called Mq. We consider a collection of alternative models denoted 
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Ml, . . . ,Mk, where each corresponds to a different parametric specifica- 
tion for the function g. These models need not be nested within each other. 
Since we wish our test of Hq to be nonparametric, K should be fairly "large" 
and the union of Mq, Mi, . . . , Mk should come close to spanning the space 
of all possibilities for g. Indeed, we can envision K growing with the number 
of observations in Y in such a way that, asymptotically, the models under 
consideration do span all the possibilities. 

Our tests of Hq are based on a posterior probability for Mq or on an 
approximation to that probability. These tests run the gamut from a purely 
Bayesian approach based on informative priors to a purely frequentist one 
that involves no prior specification at all. In any case, our tests take the 
form 

"reject Hq when 7r„ =^P(Mo|y) is sufficiently small." 

A Bayesian will make a decision, or perhaps abstain from doing so, by simply 
examining 7r„ and/or a Bayes factor. On the other hand, a frequentist will 
wish to determine the sampling distribution of P(Mo|Y) on the assumption 
that Hq is true, and then reject Hq at level of significance a if and only if 7r„ 
is smaller than an a quantile of this distribution. The frequentist may well 
regard 7r„ differently than a Bayesian. The latter views 7r„ as the probability 
that Hq is true in light of the observed data, whereas the former may simply 
view it as a statistic that contains evidence about the hypotheses of interest. 

In Section 2.1 we turn to the question of choosing alternative models 
Ml, M2, . . . , a question of relevance to both Bayesians and frequentists. Sec- 
tion 2.2 considers a formal Bayesian version of the proposed test, including 
a discussion of noninformative priors for the models Mq, Mi . An asymp- 
totic version of the test requiring no specification of priors is introduced in 
Section 2.3. 

2.1. Alternative models. We shall consider two main types of alterna- 
tive models: those which are guaranteed to contain the true function g (at 
least in a limiting sense) and those which do not necessarily contain g but 
nonetheless lead to a consistent test for virtually any g. An example will be 
helpful to illustrate these two types. In the sequel, the model for g corre- 
sponding to probability model Mj will be denoted gj. Suppose that g and 
each member of G are continuous functions over the interval [0, 1] , which 
means that g can be written as 

g{x) =g{x;e) + 5{x), 

where 5 has the Fourier series representation 

00 

(5(a;) = ^ Oj cos(7rjx), 0<x<l, 
i=o 
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for constants ao,ai, This representation for g suggests that we take 

j 

(1) gj{x;e,aj) = g{x;e) + ^ afcCOs(7rfcx), 

fc=o 



where aj = {ao, . . . ,aj)'^ . Of course, this model could be modified to suit 
a given situation. For example, if (7 is a regression function and the model 
G contains an intercept, then the constant term oq should be eliminated 
from gj . Another model that would be useful for cases where g is inherently 
positive is 



9j{x]0,aj) =g{x;e)exp 



ak cos(7r/ca;) 

k=0 



Other basis functions can be used as well; popular examples include wavelets 
and orthogonal Legendre or Hermite polynomials. Wavelet packets would be 
particularly attractive when the most parsimonious basis is unknown to the 
investigator. 

As j — > 00, functions of the form (1) span the space of all functions that 
are continuous on [0, 1]. In many settings this property is enough to ensure 
that there exist tests based on the models Mi , . . . , Mk that are consistent 
against any continuous alternative to Hq, so long as K tends to 00 at an 
appropriate rate with the sample size. An example of such a test is given in 
Aerts, Claeskens and Hart (1999). 

On the other hand, it is possible to construct consistent tests based on 
sequences of models that do not contain, even in the limit, the true function 
g. Such sequences can have certain advantages when using the methodology 
proposed in this paper. For our tests to be consistent, it is usually enough 
that the best approximation to g among the models entertained is not in Q. 
Again suppose that g is a, function defined on [0, 1] . Two candidates for gj 
are 

(2) g{x;0) + aj cos{'Kjx) and g{x;6) exp[aj cos{7rjx)]. 

Now, if g is not in G, but is continuous, then, generally speaking, there 
will exist a k such that the MLE of in g{x; 0) + cos(7r/ca;) consistently 
estimates a nonzero quantity [White (1994)]. Such a property implies the 
existence of a consistent test. 

The alternative models considered could be more or less arbitrary. For 
example, in the situation discussed immediately above we could entertain 
all models of the form 



g{x; 6) + ^ Ofc cos(7r/cx), 
fce/c 
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where /C is a subset of 0,1, ... ,K for some K. If K grows with sample size, 
such alternatives are problematic in that the number of models that must 
be fitted is 2^~^^, which becomes prohibitively large very quickly. 

In the sequel we will mainly be concerned with two classes of alternative 
models, ones that are nested and ones we shall call singletons that contain 
only one more parameter than Mq. Nested models are such that Mj is a 
special case of Mj^i for j = 0, 1, . . . , while singletons contain Mq but are 
not nested within each other. 

2.2. Formal Bayes tests. Corresponding to model Mj, j = 0,1, . . . ,K , 
are the nuisance parameters ri, parameters 6 and otj that specify g, and 
the dimension of {6,OLj,'q), denoted mj. The likelihood function for Mj is 
L{6, OLj,rji). Let pj be the prior probability of the jth model, and iTj{9, cxj,ri) 
the conditional prior density of (6,aj,ri) given that the true model is Mj. 
The posterior probability of Mq, that is, the basis of our test of Hq, is 



In a subjective Bayesian analysis, the prior probabilities pj and prior 
distributions tTj, j = 0,1, . . . ,K, are chosen to represent the investigator's 
degree of belief in the various models and the parameters therein. A Bayesian 
who wishes to do an analysis independent of his or her own prior beliefs may 
wish to use noninformative priors. In our setting, it is necessary to formulate 
such priors for the parameters in each of the models Mq , M^ and also 
to specify "vague" prior probabilities over these models. We have little to 
say here about the former problem since much has already been written 
about it. There has been much debate about what is the most appropriate 
noninformative or reference prior in a given situation, and, indeed, about 
whether or not any prior can truly express ignorance about the underlying 
parameters. Rather than entering this debate, we refer the interested reader 
to the excellent review article of Kass and Wasserman (1996) for further 
discussion of the problem and many relevant references. 

We turn now to the question of assigning vague prior probabilities to the 
models Mq, . . . , M^. One possibility is to simply give each model the same 
probability of 1/{K + 1). In as much as Hq has some special significance 



P{MQ\y) 



Po I L{OQ,ri)TTQ{dQ,r]) deQ drj 



J2j=oPj I -^(^' (^j^v)'^j '^jiV) dO doLj dr] 




where Bj is the Bayes factor of Mj to Mq, that is, 

^ J L{0,aj,r])TTj{6,aj,r]) dO dcxj dr] 
jL{eQ,rj)7rQ{eQ,rj)deQdri 
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(scientific or otherwise), there may be a prevaihng a priori degree of behef 
in it, expressed by po = vr. In this case we could take pj = {1 — 'it)/K, j = 
1,. . . ,K, to express lack of preference for any alternative model. 

For some choices of alternative models it is debatable whether assign- 
ing them equal probabilities is really noninformative. When the models are 
nested with mo < mi < • • • , one could argue that it is natural to put smaller 
prior probabilities on the models of larger dimension. Jeffreys (1961) pro- 
posed using the improper prior pj = + 1), j = 0,1, . . . , for such prob- 
lems. A proper noninformative prior for the positive integers was proposed 
by Rissanen (1983). 

Sometimes one may consider more than one model having a given dimen- 
sion. If the distinct model dimensions are thq < mi < • • • , then we may assign 
prior probability of 2~^^^^^^^ to the collection of models having dimension 
mj and equal probability to each individual model of that dimension. Such 
a scheme has been proposed by Berger and Pericchi (1996). 

It is of some interest to know what form 7r„ takes in various cases. Hart 
(1997) obtains an explicit expression for a very accurate approximation to 
TTn in a regression context where one tests the hypothesis that the regression 
function is flat. In most cases, though, it will not be possible to write this 
probability as an explicit function of the data. Numerical integration or use 
of MCMC methods will then be needed to compute 7r„. 

2.3. Tests free of prior specification. Let m(y) be the marginal distri- 
bution of the data Y. In deriving the well-known BIC for selecting model 
dimension, Schwarz (1978) showed that in exponential family models 



where n denotes the dimension of y, mj is model dimension and Lj is the 
likelihood function for model j evaluated at the MLE. Applying this ap- 
proximation to our test statistic P(Mo|y) yields 



Perhaps the most interesting aspect of this approximation, especially for a 
frequentist, is that it is completely free of prior probabilities. The statistic 
ttbic would seem to be attractive to frequentists and Bayesians alike. The 
frequentist will appreciate the fact that ttbic requires no specification of 
priors and is thus immediately usable as a test of Hq versus general alterna- 
tives. For a Bayesian, vtbic can serve as a rough and ready approximation to 
the posterior probability of Mq when the sample size is large, a property es- 
tablished in various contexts by Schwarz (1978), Haughton (1988), Kass and 



log(P(Af,|y)) wlogLj - 2mjlogn-log(m(y)) 
= BIC,-log(m(y)), 
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Raftery (1995) and Kass and Wasserman (1995). The reader is cautioned, 
however, that vtbic will not always be an adequate approximation. This is 
especially true in small to moderate sample sizes. Furthermore, the approxi- 
mation can be poor depending on the type of prior distribution used for the 
parameters of the models Mq, Mi, . . . , Mk- For more on this last point, the 
reader is referred to Kass and Wasserman (1995). 

2.4. A frequentist test. Let A = {Mi, . . . , Mk} be a collection of models, 
each of which contains the null model Mq as a special case. We consider the 
test that rejects the null hypothesis for large values of 1 — vtbicj where 

r K 

TTBIC = ] 1 + E n"(V2)(rn, -rno) exp{£,/2} 
I j=l 

Cj is the log-likelihood ratio 21og(Lj/Lo), and rrij denotes the number of 
parameters in Mj, j = 0, . . . , K . Some of the theory to be developed later 
assumes that the model is of generalized linear form. In this case, the ob- 
served data are (xi, li), . . . , (x„, 1^), where each Xj is a vector of covariates 
and each Yi a scalar response. Assuming the covariates to be fixed and the 
observations to be independent, the log-likelihood function has the form 

n 

^i9,r]) = J2i"^igi^i) - 6(5(x.))]/a(77) + c{Yi,rj), 
1=1 

where a{-),b{-) and c(-) are known functions, g is an unknown function and r] 
an unknown dispersion parameter; see, for example, McCuUagh and Nelder 
(1989). We consider testing the null hypothesis 

(3) Ho:g{x) = f2Gnji^)=9{x;e), 

where 71, ... ,7^ are known functions and 6 = {9i, . . . , 9p)^ an unknown pa- 
rameter vector. The asymptotic maximizer of the expected log-likelihood 

1 " 

-Y,[b'{g{x,))g{xi-e)-h{g{xi-e))] 

1=1 

with respect to 6 is denoted = (^lO) ■ • ■ > Gpo)'^ : which is the true parameter 
vector when Hq is true and provides a best null approximation to g when 
Hq is false. 

Our most general alternatives Mj, j = 1,2, . . . , are of the form 

p j 
9jix) = ^Oai{x) + ^aiVi{x) 

1=1 i=l 
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for appropriate functions vi, U2, . . . . To produce test statistics that are mean- 
ingful and powerful, we insist that the fj's be orthonormal in the following 
sense: 

n 

(4) ^t;,(x,)7fc(x^)6"(5(xi;0o)) = O, j = 1, 2, . . . ; = 1, . . . ,^5, 

i=l 

and for all j,k> 1, 

1 " r 1 ■ — 

(5) -^6"(<7(x.;0o)H(xiK(x.) = |o; 

In practice, we may achieve an approximation to (4) and (5) by proceeding 
as follows. First, let (0O)%) be the maximizer of the null likelihood func- 
tion. We assume that 6q converges in probability to 6q. Now, choose a set 
of functions ui,U2, ■ ■ ■ that is a basis for all functions of interest. Then use 
a Gram-Schmidt procedure to construct vi, . . . ,Vn~p that are linear com- 
binations of 7i, . . . , 7p, ni, . . . , ii„_p satisfying (4) and (5) with 6q and VjS 
replaced by 9q and vjs, respectively. 

For generalized linear models, the likelihood ratio statistic Cj can be 
explicitly obtained as 

n 

Cj = 2^[y,(Ai,- - A,o) - {6(A,,) - 6(A,o)}], 

i=l 

where, for j = 0, . . . ,K , 

^ gj{^i;9{Mj),aj) 
a{r)(M,)} ■ 

Note that the maximum likelihood estimators 0{Mj), T){Mj) and dcj depend 
on the model used. 

3. Numerical results. The applicability of the proposed tests is illus- 
trated by a simulation study in a simple regression setting in Section 3.1 
and by an example involving variable star data in Section 3.2. S-Plus is used 
for calculations. 

3.1. Simulations. We consider normal response data 

(6) y,~AA(7(x,),r?), 

where Xi = {i — l/2)/n, i = 1, . . . ,n. The mean 7(-) is the parameter of inter- 
est and 1] is the unknown variance parameter. In all settings the sample size 



BAYESIAN TESTS OF FIT 



9 



was n = 100 and rj = 0.1. We focus on testing for no effect, that is, 7(x) = 9. 
For the alternative models Mj we take 

jj{x) = e+ 4>kUk{x), j = i,...,K, 

k£K.j 

with JCj a subset of and Ufc(-) =pfc(-), the normalized Legendre 

polynomials on the interval [l/(2n),l — l/(2n)], k = 1, . . . ,K . To examine 
the influence of the choice of K, all simulations were repeated for K = 10 
and K = 20. 

Define AIC^ = logLj - rrij [Akaike (1974)], 

fa = arg max AlCj and = arg max BICj . 

0<j<K 0<j<K 

We compare the singleton (Bs) and nested (Bn) versions of the Bayes- 
motivated statistic ttbic (Theorem 1) with some other nonparametric om- 
nibus tests: 
the tests 

La = Cf^ and Li, = Cf^ , 

the "max-test" based on 

Mq = max — 21ogi^ + loglogi^ + logvr, 
0<j<K ■' 

and, finally, the adaptive Neyman test Na, which is based on the squared 
discrete Fourier transform of the residual vector from the fitted null model 
[Fan and Huang (2001), Section 2.1]. We also included two parametric like- 
lihood ratio tests comparing the null model Mq with the true (unknown) 
alternative model Mj ( "Oracle" test) and with the "full" model (FM) Mr 
with 1Ck = {1,...,K}. 

The tests La, Li, and B^ are all based on a sequence of nested alternative 
models with ICj = {1, . . . , j}, j = 1, . . . ,K . Score versions of La and Li, were 
studied in Aerts, Claeskens and Hart (2000), who established that in the 
present scenario La converges in distribution to Wf and L;, to Wi, where 

Wr = Vi H h for r = 1, 2, . . . , iT, Vq = 0, Vi,V2, . . . ,Vk is a sequence of 

independent Xi random variables and f is the value of r that maximizes Wr — 
2r over r = 0,1,. . . ,K. The tests Bs and Ms apply singleton alternative 
models with fCj = = 1, . . . ,K. The test Ms is expected to have power 
characteristics similar to those of Bs (Theorem 5). 

The definition of ttbic suggests that the distributions of 

Ef=iexp(y,/2) 
l+n-i/2^f^^exp(y,/2) 
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and 

l + Ef=ir^-^/2exp(ELi^./2) 

be used as finite sample corrected approximations to the distributions of 
Bs and Bf<[, respectively. For L^, which converges in distribution to a Xi 
random variable, we include a corrected distribution defined as that of Wf, 
where Wr is as before and f is the value of r that maximizes Wr — r log n 
over r = 1, . . . ,K . 

From a simulation based on 30,000 replications, we obtained critical points 
(levels a = 0.01, 0.05, 0.10) of the large sample distribution of each test statis- 
tic, except for Ms and A^^, which are asymptotically distributed with dis- 
tribution function exp(— exp(— 2;/2)) (Theorem 5) and exp(— exp(— x)) [Fan 
and Huang (2001), Theorem 1], respectively. The critical values are shown 
in Table 1. Although the limiting distributions oi B]\[ and Lf, depend on K, 
the simulations produced critical points that were identical for both values 
of K. 

Table 2 shows simulated type I error probabilities for all omnibus tests 
based on a simulation of size 5000 with K = 10. The results are very simi- 
lar for K = 20. As mentioned in Fan and Huang (2001), the approximation 



Table 1 

Simulated critical points of limiting null distributions 



Test 


K 


a = 0.10 


a = 0.05 


a = 0.01 


La 


10 


9.393 


13.521 


21.028 


La 


20 


9.985 


14.871 


28.103 


Lb 




3.460 


5.620 


10.832 


Bn 




3.728 


5.105 


8.149 


Bs 


10 


8.170 


8.724 


9.598 


Bs 


20 


9.027 


9.339 


9.795 



Table 2 

Simulated type I error probabilities for K = 10 



Test 


a = 0.10 


a = 0.05 


a = 0.01 


La 


0.100 


0.063 


0.019 


Lb 


0.102 


0.050 


0.010 


Bn 


0.094 


0.052 


0.010 


Bs 


0.109 


0.055 


0.012 


Ms 


0.079 


0.036 


0.006 


Na 


0.125 


0.069 


0.017 
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Fig. 1. Alternative models: upper row 7^(2:), lower row 7^(2;), for m = 1, . . . , 10. 



exp(— exp(— x)) is not so good. This was confirmed in our simulations. The 
simulated type I error probabilities of the adaptive Neyman test, based on 
the simulated critical points of Table 1 in Fan and Huang (2001), are con- 
siderably better (see last line in Table 2). The true levels of most tests are 
close to the nominal levels. The max-test is somewhat conservative, whereas 
the adaptive Neyman and the La test are slightly liberal. 
To examine power we consider two types of alternatives: 

(9) 'y^{x) = Umix) 

and 

1 m 

with m ranging from 1 to 10. These alternative models are ordered in the 
sense that they incorporate higher frequency terms as m increases; for 7^(2;) 
as single effects and for j^i^) ^ nested effects (see Figure 1). 

In Figure 2, power results are shown for 1000 data sets generated from the 
alternative models (9) and (10), respectively. In all cases, the sample size n 
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K=10 K=20 




equals 100 and the level of significance is equal to 0.05. For all omnibus tests, 
critical points were calculated using the 5000 simulated data sets under the 
null hypothesis, and, hence, each omnibus test has true level very close to 
0.05. 

Focusing on the upper panels (single effect alternatives), four tests essen- 
tially show constant power: the Oracle test, next the singleton test Bs and 
max-test AIs with almost identical curves (as expected from Theorem 5), 
and the full model test. When increasing the value of K from 10 to 20 (from 
left to right panel), the power decreases somewhat, especially for the full 
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model test. The power characteristics of Lf, and Bj^j are comparable (with 
some advantage for Li,): they have the highest power for the first lower fre- 
quency terms but their power drops down rapidly, with very comparable 
values for both values of K. The adaptive Neyman test also has a decreas- 
ing trend, but with strikingly higher powers for even alternatives. This is 
related to the fact that the cosine based Fourier transform terms enter the 
sum in the test statistic first, alternating with the sine terms. Finally, the 
only test with an increasing power curve is the La test. For the single effect 
alternatives, the Bayesian-motivated test Bs is clearly the best choice. 

For the nested effect alternatives (lower panels in Figure 2), only the full 
model test has seemingly constant power behavior; but the higher the value 
of K (making the test more omnibus), the less competitive this parametric 
approach becomes. The singleton test Bs and the max-test Ms are again 
very close and somewhat comparable to the adaptive Neyman test A^^. But 
their overall performance is rather poor. The best choices, especially for K 
large and for alternatives J^ix) with m <7, are the Bayesian-motived test 
B]y and the Lf, test. As for the single effect alternatives, the La test seems 
to be a good choice for (very) high frequencies. 

No single omnibus test is superior for all types of alternatives. This general 
statement, which is accepted as a sort of consensus by many statisticians, 
is confirmed by this (small) simulation study. It also shows the importance 
of additional knowledge, from experts in the application area, about the 
plausibility of certain types of alternatives. 

3.2. Analysis of data from a variable star. Astronomers, both profes- 
sional and amateur, have collected masses of data on variable stars [Mattel 
(1997)]. The length of time between consecutive maximum brightnesses of 
a star is an important quantity to astronomers since it contains information 
about the age and other properties of the star. We shall refer to these lengths 
of time as "pseudo-periods," since they tend to fluctuate substantially about 
the star's actual period, which is determined by fltting a periodic function 
to observations. Of particular interest is detecting systematic changes, or 
trends, in a star's period [Koen and Lombard (2001)]. 

Here we will apply the methodology introduced in this paper to test for 
period changes in the long-period variable Omicron Ceti, or Mira. Both 
a frequentist and a "proper" Bayesian analysis of the data will be done. 
The data are (j, Yj), j = 1, ... ,76, where Yj is the observed time (in days) 
between the (j — l)st and jth maxima on Mira's light curve. The light curve 
is simply Mira's brightness as a function of time. A plot of the observed 
pseudo-periods is given in Figure 3. 

Note that we may treat Yi,Y2, . . . , as a time series, although the index j 
is not actually time. A model often used by astronomers is as follows: 

Yj=P + Aj + Ij + Ej - Ej-i, j = 1, 2, . . . , 
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where P is the mean period of the star, Aj is a systematic deviation from 
the mean period, Ij represents random variation intrinsic to the star, and 
Ej is the error made in measuring the jth time of maximum brightness. 

A common set of assumptions is that the e^s are i.i.d. with mean and 
variance o"^ < oo, the /j's are i.i.d. with mean and variance aj < oo, and 
the two series are independent of each other. Our model generahzes two 
aspects of this one. First of ah, we ahow for heteroscedasticity among the 
EjS via the model 

Var(ej) = exp(7;o + vij), j = 1, . . . , 76. 

This model is motivated by analysis of data from 378 variable stars by Hart, 
Koen and Lombard (2004), which indicates a tendency for residual vari- 
ance to decrease over time, a not unexpected phenomenon since observation 
methods have improved with time. A second difference in our model is that 
we allow the /j's to follow a first order autoregressive [AR(1)] model, that 
is, 

= P^j-i + ^j^ J = 2, . . . , 76, 




Fig. 3. Mira pseudo-periods and two estimates of trend. The solid line is a sixth degree 
polynomial, the model chosen by BIC. The dashed line is a local linear smooth. 
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where \p\ < 1 and the ZjS are i.i.d. mean random variables with finite 
variance cr^. Our motivation for using an AR model is to circumvent a false 
indication of trend. It is well known that the actual size of a trend test 
assuming independent data is usually larger than the nominal size when the 
data exhibit positive serial correlation. 

We will model the trend Aj, j = 1, . . . , 76, as a polynomial of unknown 
degree, and take K = lh as an upper bound on the degree. To obtain a 
likelihood function, we assume that both the e^s and ZjS are Gaussian. 
Therefore, our complete model says that li, . . . ,^76 are jointly normal with 
means of the form 

E{Yj) = /5o + /3ij + • • • + h3\ J = 1, . . . , 76, 

and covariance matrix defined by 

{c^l/ (1 - P^) + exp[uo + vij] + exp[?;o + vi{3 - 1)], 
pc7|/(l - p^) - exp[?;o + vi min(i, j)], \i - j| = 1, 

pl-^V|/(l-p2), |i-j|>l. 

We wish to test the hypothesis 

: Ai = A2 = • • • = A„ = 0. 

In our frequentist analysis, two test statistics were computed. One is ttbic 
for the nested polynomial models with degrees 0, 1, ... , 15, and the other is 

15 \ -1 



^singleton 



' 15 \ 

1 + exp[log(Lj/L,_i) - log(76)/2] 
^ i=i / 



where Lj is the maximized likelihood for the degree j polynomial model. 
The components Lj/Lj^i, j = 1, . . . , 15, are approximately independent of 
each other, with the jth component representing the relative increase in 
likelihood when stepping from a {j — l)st to a jth degree polynomial. 

The values of ttbic and TTsingicton for the Mira data were 0.000161 and 
0.00265, respectively. Inasmuch as these quantities are good approximations 
to posterior probabilities of no trend, this is already considerable evidence in 
favor of a trend. However, we may also use frequentist methods to judge the 
significance of these values. A parametric bootstrap was used to approximate 
the distribution of the two statistics on the assumption that Hq is true. 
Data were generated from the estimated error model corresponding to the 
polynomial degree maximizing BICj, j = 0, 1, . . . , 15. The estimated optimal 
degree was 6, and the maximum likelihood estimate of cr^/ exp(fo) at degree 
6 was 0. Essentially, this says that the experimental errors, £j — ej-i, are 
estimated to be so large that they completely overwhelm the intrinsic errors. 
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statistic 

Fig. 4. Approximations to the distributions of ttbic (right) and TTsingicton (left). The 
solid lines are obtained from a Gaussian bootstrap, and the dashed lines are asymptotic 
distributions. 



Ij. The maximum likelihood estimate of vi at degree 6 was —0.001816. In 
our bootstrap procedure, we thus generated observations Y* according to 



j 



Sj — J — li • • • 5 76, 

where the e*s are i.i.d. with e* ~ iV(0, exp(-0.001816j)), j = 0,1,..., 76. 
(Since the distributions of our likehhood ratios are invariant to a constant 
mean and to vq, we took these two parameters to be 0.) 

One thousand sets of bootstrap data were generated, and on each one 
we computed vt^jq and 7r*j^gjj,^Qj^ in exactly the same way that ttbic and 
TTsingicton Were Computed from the original data. Kernel density estimates 
for the two bootstrap distributions are shown in Figure 4. In addition, we 



provide estimates of the densities of ttbic, asy and vr. 



Tsinglcton,asy j 



7TBIC,asy 



15 

-^exp 

j=l Li=l 
15 



£v;/2-log(76)j72 



where 



-1 



TTsingleton,asy — 

l + 5]exp[y,/2-log(76)/2] 

V j=i J 
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and Vi,...,Vi5 are i.i.d. Xi random variables. The two latter distributions 
are large sample approximations to the null distributions of the two statis- 
tics. 

The two approximations to the distribution of vtbic are in close agreement, 
while those for VTsingicton differ somewhat. The bootstrap distribution has a 
heavier left tail than the large sample approximation. Estimated P-values 
for ttbic and vTsingioton are and 1/2000, respectively, these being based on 
the two bootstrap distributions. So, the frequentist analysis provides strong 
evidence of a trend in the Mira pseudo-periods. Estimates of trend are seen 
in Figure 3. 

We now describe a Bayesian analysis of the data. Priors for all model 
parameters were determined empirically by fitting distributions to maximum 
likelihood estimates for a database of 378 stars, one of which is Mira. The 
prior for the polynomial degree k is of particular importance since the prior 
probability of the null hypothesis is simply the prior probability of A; = 0. 
We considered three different priors for k: uniform over 0, 1, ... , 15, 



TTi (fc) = -, /c = 0, 1, . . . , 15, 

^ ' 3.381(/fc-M)' > > > > 



and 



■n2{k) 



ro.5, A; = 0, 

\[2(2.381)(fc + l)]-^ A; = l,...,15. 



The prior vri is a truncated version of Jeffreys' noninformative prior for 
an unrestricted positive integer [Jeffreys (1961), page 238], while 1^2 is a 
modified version of vri that is "fair" to the null hypothesis, in that 712(0) = 
0.5. Posterior probabilities of each polynomial degree were approximated 
using a modification of Laplace's method that accounts for the possibility 
that the MLE of cr^ can occur at its lower boundary of 0. The results 
are given in Table 3. Regardless of which prior is used for fc, the posterior 
probability of a trend is at least 0.978, and, hence, the Bayesian analysis is 
in basic agreement with our earlier frequentist one. 



4. Properties of frequentist tests. We now investigate asymptotic fre- 
quentist properties of the test statistic 1 — vtbic ■ We show how the limiting 
distribution of this statistic depends on the class of models A, and we study 
the power of a version of the test based on singleton models (Section 2.1). It 
will be shown that the "singleton" test can detect local alternatives tending 
to the null at rate -y/log n/-y/n, and that its limiting power is completely 
determined by the largest Fourier coefficient of the true function. Proofs of 
all theorems are provided in the Appendix. 
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Table 3 

Approximations to posterior probabilities of polynomial degrees k for Mira data. The first 
row is obtained using the classical BIC approximation to posterior probabilities, while the 
other three are based on a proper Bayesian analysis with different priors for k, as 

explained in the text 



Prior 
for k 












k 















1 


2 


3 


4 


5 


6 


7 


8 


9 


> 10 


BIC 


0.000 


0.000 


0.000 


0.000 


0.002 


0.320 


0.489 


0.059 


0.095 


0.030 


0.005 


Uniform 


0.001 


0.003 


0.001 


0.000 


0.002 


0.107 


0.335 


0.109 


0.189 


0.129 


0.124 


TTl 


0.009 


0.011 


0.003 


0.001 


0.003 


0.141 


0.377 


0.108 


0.166 


0.102 


0.079 


Tf2 


0.022 


0.011 


0.003 


0.001 


0.003 


0.139 


0.372 


0.106 


0.163 


0.100 


0.080 



4.1. Limiting distribution under the null hypothesis. Our first two the- 
orems are quite general in the sense that we only make assumptions about 
the limiting behavior of the log-likelihood ratios Cj = 2log{Lj /Lq). These 
assumptions hold for a great variety of likelihood models. In the sequel, 
Xk denotes a random variable having the chi-squared distribution with k 
degrees of freedom. 

The effect of A is well illustrated in our first theorem, in which A contains 
finitely many models. 

Theorem 1. Let A be a set containing only a finite number of differ- 
ent models, Mi, . . . ,Mk, all including the null model Mq as a special case. 
Denote by m the minimal set size m = mini<j<x(|Mj |), where \M\ is the 
dimension of model M , and define 

JCm = {j e{l,...,K}: \Mj\ =m} = {m(l),m(2), . . . ,m(m)}. 

We assume the following conditions hold: 

(i) For j = 1, . . . , K , the log-likelihood ratio Cj is bounded in probability 
as n —> oo . 

(ii) . . . , >Cm(m)) ^ (^1, ■ • ■ , ^m), where Vi,...,Vfh are jointly dis- 
tributed random variables each having the Xm~mo distribution. 

It then follows that 

m 

n("°)/2(l - ^Bic) ^Y.^xp{lVj). 

i=i 

Perhaps the most important aspect of Theorem 1 is the fact that the 
limit distribution of 1 — ttbic is completely determined by the models in A 
with the fewest parameters. In the special case where ^ is a finite sequence 
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of nested models, Theorem 1 implies that n('^~™o^/^(l — ttbic) converges 
in distribution to exp(^Xm-mo)) where m is the number of parameters in 
the smallest model in A. This "fewest parameters" phenomenon can also be 
seen in the BIC-based goodness-of-fit test proposed by Ledwina (1994), and 
is a result of the fact that BIC consistently chooses the null model when Hq 
is true. For more discussion on the phenomenon, see Claeskens and Hjort 
(2004). 

Our next two theorems address cases in which the number of alternative 
models tends to co with n. Theorem 2 is essentially a corollary to Theorem 
1, and, hence, we do not provide its proof. 



Theorem 2. Let Mq, Mi, . . . be a sequence of nested models containing 
numbers of parameters rriQ < mi < • • • , respectively. Assume that under Hq 
and as oo, 

15 2 

Furthermore, assume that, as oo, Cj is bounded in probability for each 
j = 2,3, Then there exists a sequence {Kn} tending to infinity such that 



n 



(mi-mo)/2 



1 - 1^1 + ^^exp(BICj - BICo)^ 



6^P( 2Xr?ii— mo) 



as OO. 



We now assume that the data follow a generalized linear model, as dis- 
cussed in Section 2.4. We study the case where A = Ar consists of the 
singleton models Mi, . . . ,Mk discussed at the beginning of Section 2, and 
we let K tend to infinity with n. Theorems 1 and 2 show that the asymptotic 
null distribution of ttbic generally depends only on the models having the 
smallest number of elements. Therefore, our next theorem is more general 
than it first appears, since it also describes the limiting distribution of ttbic 
in many cases where the alternatives consist of singletons plus other, larger 
models. 

Define the statistic Sn by 

^ f no?- \ 

where 

1 ^ 

aj = -Yl[yi - b'{g{xi;9o))]vj{xi), j = l,...,K. 
i=i 
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From the definition of ttbiC) we have 



77,(1 - TTBIC J 



1 + Sn/\/n 



where Sn = X^j^i 6xp(£j/2). The statistic Sn is to Sn as a score statistic 
is to a likelihood ratio statistic. The quantity na'j/a{'fio) is known to have 
the same limiting distribution as the log-likelihood ratio Cj under the null 
hypothesis and general regularity conditions, which suggests that under gen- 
eral conditions the limiting distribution of \/n(l — ttbic) is the same as that 
of Sn- In order to simplify matters by having an explicit expression for the 
test statistic, we thus state Theorem 3 in terms of Sn- 



Theorem 3. Define the constants 

qk = • = and bx = — ^ f -—J^-J==^dx, K = l,2, 

2 A/log A' ^/^T J I x'^^logx 

Under assumptions A1-A8 in the Appendix, 

Sn — a-K V 

as n and K tend to infinity, where S has the stable distribution 5*1(1,1,0), 
in the notation of Samorodnitsky and Taqqu (1994)- 



The most interesting aspect of Theorem 3 is that the limiting distri- 
bution of Sn is not normal. This results from the fact that each term 
exp(?^Q^^/[2a(7)o)]) converges in distribution to exp(xi/2). Now, exp(xi/2) 
does not have first moment finite, and, hence, the classic central limit theo- 
rem does not apply to Sn- However, the distribution of exp(xf/2) is in the 
domain of attraction of the stable distribution S'i(l, 1, 0), as is easily verified 
by checking the conditions of Theorem 1.8.1 in Samorodnitsky and Taqqu 
(1994). 

Some remarks on the size of K are in order. Ideally, we would allow K 
to be as large as n — p. However, our method of proving Theorem 3 allows 
K to be no larger than o(n^^^). Further restrictions on K may be necessary 
depending on the choice of basis functions. The key assumptions in this 
regard are A2 and AS. Suppose that ui,U2, - - - are trigonometric functions or 
Walsh functions [Golubov, Efimov and Skvortsov (1991)]. Then the bounds 
Bk (in A2) are constant for every K, and no further restriction on K is 
required. If the dimension of the covariate is 1, and ui,U2, - - ■ are Legendre 
polynomials, then Bk = (constant) [Szego (1975), pages 68 and 184] 
and, again, no further restriction is needed. It is also worth mentioning 
that the only assumption among A1-A8 affected by the dimensionality of 
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the covariate x is A2. The bounds Bi,B2,--- will, in some cases, tend to 
increase with the dimensionality of the covariate. For example, if one uses 
products of Legendre polynomials as basis functions, then Bk will be of 
order K'^^'^, where d is the dimensionality of x. This, of course, will further 
reduce the allowable size of K. 

If the practitioner feels it necessary to choose a rather large value of K, 
and is concerned about using the large sample distribution of Theorem 3, 
then bootstrap methods could be used to approximate the distribution of 
the test statistic. 



4.2. Power against local alternatives. Here we consider power against 
local alternatives, that is, alternatives that tend to the null hypothesis as 
n ^ oo. We provide rates and constants for local alternatives such that a 
test based on 5^ has power tending to 1 and another rate (and constants) 
such that the power tends to p, a <p <1. 

Theorem 4. Let assumptions A1-A8 in the Appendix hold, and assume 
that the function g in our generalized linear model ( GLM) has the form 



( \ ( a \ ^ ( 71 + 72 V2 log ax \ , / n 

gn{x) = g{x-eQ) + y — 'j^^(t>jVj{x), 

where — oo < 71 < oo and 72 > are constants. We assume that one of\4)j\ is 
strictly larger than all others, and define C, = maxi<j<m \ (t>j\/ \/ CLijlo)> where 
a{r]Q) is the dispersion parameter in the GLM. Letting s^ be the (1 — q) 
quantile of the stable distribution S'i(l,l,0) and $ the c.d.f. of the standard 
normal distribution, it follows that 

lim p( ^I^^ >sA = \a + {l- a)$(7iC), 72 = 1/C, 
^ Ml, 72 > l/C 



n— »oo 



It is important to note that the limiting power of the S'n-based test is 
determined by the largest Fourier coefficient of the true function. In contrast, 
the power of tests based only on nested alternatives is largely determined 
by the coefficients of the smallest alternative models, regardless of whether 
those coefficients are the largest ones. [See, e.g., Aerts, Claeskens and Hart 
(2000).] For this reason tests based on nested alternatives often have poor 
power against high frequency alternatives, since lower frequency models are 
the default "simplest" models. Owing to the nature of SnS null distribution, 
it is not too surprising that the power for 5„ is determined by the largest 
Fourier coefficient. LePage, Woodroofe and Zinn (1981) show explicitly that 
the limit of sums converging to a stable law is determined by the few largest 
summands. 
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The connection between Sn and the largest sample Fourier coefficient 
becomes even clearer in the next theorem. We consider the test that rejects 
Hq for large values of 

(11) Rn= max —rrr 

i<j<K[a{r]o)_ 

and show that its limiting power against the local alternatives of Theorem 4 
matches that of Sn- Since i?„ is undoubtedly more familiar to most readers 
than is Sn, this result provides a sort of benchmark for understanding the 
power properties of Sn- 

Theorem 5. Let Rn be the statistic defined in (11), and suppose that 
A1-A8 hold. Then if Hq is true, 

lim P(i?„ - 21ogi^ + loglogi^ + log7r < x) 

n,K—*oo 

= exp(— exp(— a;/2)) for each x. 

Now define Xa = — 21oglog(l — a)~^ , the 1 — a quantile of the distribution 
exp(— exp(— a;/2)). When the local alternatives of Theorem 4 hold, 

lim P{Rn-2logK + loglogK + log7r>Xa,)= lim pf "^""^^ > 

where the latter limit is given in Theorem 4. 

4.3. Lindley^s paradox. Lindley's paradox refers to situations where the 
posterior probability of a hypothesis, Hq, is very high, say 0.95, and yet a 
frequentist test indicates strong evidence against Hq, in that the P- value 
for Hq is small, say 0.01. Typically a frequentist does not have to deal with 
Lindley's paradox since he or she does not compute posterior probabilities. 
However, a frequentist using the tests proposed in Section 3 cannot help but 
notice it since the test statistic itself is a posterior probability. A level a test 
has the form 

(12) "Reject if vr„<p„,^," 

with pn^a ^1 as n — > DO. This implies that for large enough n, a posterior 
probability (for Hq) of, for example, 0.99, would lead to rejection of Hq\ 

For frequentists concerned with Lindley's paradox, a relevant question is 
"at what sample size does the paradox begin to manifest itself?" It seems 
reasonable to say that the paradox occurs only if we reject Hq when Hq is 
a posteriori more probable than Ha. Therefore, we may ask at what sample 
size does the critical value of test (12) become larger than 1/2? The BIC 
approximation to the posterior probability of Hq is 

1 



TTBIC ■ 
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Let us assume that — mo = 1 for each k and that L^/Lq, k = 1, . . . ,K, 
are asymptoticaUy independent, as is true for the singleton models used in 
Theorem 3. When Hq is true, the distribution of vtbic is thus approximated 
by that of 



where Vi,...,Vk are i.i.d. Xi random variables. Consider the test that re- 
jects Hq at sample size n and nominal level a when vtbic is no more than 
Pn,K,a, the a quantile of the distribution of (13). For purposes of discussion, 
we will say that Lindley's paradox occurs when Pn,K,a > 1/2. Of course, ttbic 
is only an approximation to the posterior probability of Hq, but Kass and 
Wasserman (1995) provide evidence that ttbic is an excellent approximation 
to -Kn for certain reference priors. To be on the safe side, we could say that 
1/2 < TTBIC 1^Pn,K,a is an example of Lindley's paradox in cases where one 
is using the appropriate reference priors. 

Figure 5 displays approximations of the 95th percentile of the distribution 
of (13) as a function of ^/n and for different values of K. The approximations 
were obtained by generating 10,000 independent values of (13). The graph 
indicates that for a test of nominal level 0.05, Lindley's paradox is a very 
large sample phenomenon when using a X of 10 or more. For K = 10, values 
greater than 0.5 are not included in the rejection region until n is more than 
6000. On the other hand, the paradox can occur for K = 1 when n is as small 



(13) 
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Fig. 5. Approximate 95th percentiles of (13). From top to bottom, the curves correspond 
to K = 1, 5, 10, 20, respectively. 
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as 64. The case K = 1 is of particular interest since then the distribution of 
(13) approximates that of our statistic for testing Hq against a sequence of 
nested alternatives. 

A way of resolving Lindley's paradox is to use a test of the form 



What effect does such a rejection region have on the power and level of the 
test? Typically, for all n less than some no, test (14) will be identical to 
(12). For larger n, test (14) will have level of significance smaller than a 
and, indeed, tending to as n — > oo. Of course, the smaller rejection region 
will lead to an attendant reduction in power. However, in a certain sense the 
reduction is quite small. It can be shown that test (14) has power tending 
to 1 for both fixed alternatives and local alternatives tending to the null at 
rate (logn)^ /^/n, where r] > 1/2. For local alternatives tending to at rate 
1/^/n, though, the power of (14) tends to 0. Apparently, this is a price that 
must be paid to avoid Lindley's paradox. 

5. Concluding remarks. A very general means of testing the fit of a 
parametric function has been proposed. The parametric model is rejected if 
its posterior probability is too small. The test can be carried out in either a 
Bayesian or frequentist way. Alternatives to the null hypothesis are modeled 
by a sequence of models, which need not be nested. Our simulation study 
supports the conclusion that test validity is generally well maintained by 
use of an asymptotic distribution. It also shows that our proposed tests can 
compare favorably with other omnibus lack of fit tests. 

Although some of the theory assumes the model is of the generalized linear 
form, the test can be used in a general likelihood context, including discrete 
or continuous data and multivariate data with dependence among observa- 
tions. Our example using variable star data illustrates the fact that our test 
can accommodate dependent data. The applicability and the performance of 
the method in a variety of complex settings, including longitudinal and other 
types of clustered data, is a topic of current research. In the multiple regres- 
sion case, where the covariates belong to a subset of (d > 1), a variety of 
sequences of alternative models, including singleton models and variations 
thereof, can be chosen and it is not clear which sequence is preferable or 
leads to optimal power characteristics. An extensive simulation study in a 
variety of settings can shed more light on these important practical issues. 
Furthermore, a score version of the proposed frequentist test can be consid- 
ered, as well as robust versions of it. These variations and extensions are 
currently under investigation. 



(14) 
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APPENDIX 

Following are assumptions needed in our proofs of Theorems 3 and 4: 

Al. The design points xi, . . . , x„ are fixed and confined to a compact subset 
S of R'^ for all n. 

A2. The functions 71, ... , 7^, ui,U2, ■ ■ ■ satisfy the following assumptions: 

(i) There exists Bl < 00 such that 

sup |7j(x)|<i?J' and 

(ii) there exists a sequence of positive constants {Bj-.j = 1,2,...} 
such that 

sup \uj{-x.)\ < Bk, K=1,2,.... 

i<j<K,xeS 

A3. The functions vi,V2, ■ ■ ■ satisfy (4) and (5) and -Oi, -02, . . . are constructed 
from 71, . . . , 7p, ui,U2, ... as described at the beginning of Section 3. 

A4. Let Anj^ denote the n x K matrix with i,j element Mj(xj). Then we 
assume that the diagonal elements of A^j-^A„j^/?i are all 1, and that 
the smallest eigenvalue of A^^A„^i^/n is bounded away from for all 
n and K. 

A5. The dispersion parameter a(?7o) is positive, and the MLEs fjQ and 9q of 
rjQ and Oq, respectively, are such that E{a{jiQ) — a(?7o))^ and E\\dQ — OqW^ 
exist and are each 0{n~^). 

A6. Let be the parameter space for 0. There exists a compact, connected 
subset AT of such that 6q £ M and, for each x G 5, g{x] 6) is a con- 
tinuous function of on J\f . 

A7. The function b is thrice differentiable with 

sup \h"'{g{^;e))\<Bl 

for some constant B2, the function (of x) b"{g{x;6)) is nonnegative for 
each 6 £Q, and 

inf fe"(5(x;0)) >0. 

A8. The number of singleton models, K, tends to infinity with n in such 
a way that K < ni/8-« and BkK'^I'^ < n^l"^ where a is any number 
such that < a < 1/8. 

Proof of Theorem 1. Using the explicit expression of the BIG num- 
bers for the models under consideration, we can write the test statistic in 
the following form: 

{l/2)(m-mo) sr^K -(l/2)(mj-mo) py--^! r . /O^l 



1 + E^i exp(£j/2) 
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By definition of m and assumption (i), the denominator is l + Op(n^^"'^/^)^"*~'""^), 
while the numerator is equal to n-(V2){m-mo) exp(£^(j)/2) +Op(n-(V2)(m-mo))_ 
The result now follows from assumption (ii). □ 

Proof of Theorem 3. Throughout the proof Ci,C2,... denote pos- 
itive constants that depend on neither n nor K. To simplify notation, we 
have suppressed the dependence of the Vj^s and Vj^s on n. 

We may express as 

— + ^li + ^2i, 

where, for j = 1,2, . . . , 
1 

aj = -Yl[yi - b'{g{:iCi;6o))]vj{xi), 
i=i 

1 " 

and 

62, = -J2[b'{g{^^■,eo)) - h'{g{^i-eo))]vj{^i). 



n . , 
1=1 



1 " 



2=1 



We may write 



K K K 

Sn = Y exp([/j„/2) + ^ exp(C/,„,/2) [exp(flj„) - 1] = XI exp(C/,„/2) + ri^„, 
j=i j=i j=i 

where Ujn = na'j/a{r]Q) and Rjn = {Vjn — Ujn)/2. Obviously 

K 

\rKn\ < max | exp(i?j>,) - 1| V exp(C/,>,/2). 

The remainder of the proof consists of two main parts: 

(a) Showing that 6n = maxi<j<i^ | exp(i?j„,) — 1| is asymptotically negli- 
gible, and 

(b) obtaining the large sample distribution of X^jLi exp(L'j>i/2). 
We first consider (a). By Taylor's theorem, 

exp{Rjn) = 1 + Rjnexp{Rjn) 
for Rjn such that \Rjn\ < \Rjn\, and so 

5n< max \Rjn\exp{\Rjn\). 
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(A.l) 
Define 

and 



2j 



2 Va(i7o) a{r]o) 



a[r]o 



{eij +e2j) 



2a{fio) 



By (A.l) we have 

/ . K 3 

P[ma^\Rjn\eM\R,n\)>e)<Y,Y.P(\T<^j\>^'/^)- 

V -J- / j = l£=l 



Clearly, 



P \T„\>'-)<P ny'a]>2 



1 



1 



«(%) a{vo) 



> 



By Markov's inequality. 



^ - 2v/i73 



2Ve' 



where we have used E{Yi) = b' {gixfjOo)), Var(yj) = a{riQ)b" {g{xi;6Q)) and 
A3. A bit of algebra shows that, for all n sufficiently large. 



1 



1 



aim) a{rjo) 



> \ - ]<P{\a{rjo)-a{vo)\> 



o(??o)\/e' + \/3nV3 



By Markov's inequality and A5, the last probability is 0{n ^/^). We have 
thus shown that Ef=iP{\Tij\ > e'/S) = 0{Kn-^/^). 
We turn now to the terms T2j, for which 

+ P{n\aj\\e2j\ > 



^'l 1^2,1 >-) <P(7^|a,||ei,|> 



+ P a(f/o) < 



12 

«(%) 



— ; 
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The third summand in the last term is 0(n^^), independent of j. Letting 
Ci=e'airjo)/12, 



P{n\aj\\e2j\>Ci)<P{n^/^\aj\ > ^/C^) + P{n^^^\e2j\ > VC^) 



We have 



1 " ^ 



^2J = --II 



4 = 1 



(5(x,;0o)-9(xi;0o))6"(5(xi;0o)) 

+ Oo) - g{xi; Oo)fb"'{gi) Vj{xi), 

where gi is between g{xi;6Q) and gixfjOo). The orthogonahty properties (4) 
imply that the last expression is simply —{2n)~^ J2i'=i{9i^i'i ^o) — g{^u ^o))'^ x 
b"'{gi)vj{xi), and so 

\e2j \ < -J2i^"'i9i) fvj{^i) max(c/(xj;0o) -9(xi;0o))V2 

X"^ J l<t<n 

2 W"im)\ 



^ C2\\0q — 6q\\ max 



i<i<n ^b"{g{^f,eo)) 
It follows that 

P{n^/'\e2j\ > VCI) < P{n^/'\\Oo - Oof > C3) +pf max \b"'{gi)\ > cX 

\l<i<n J 

where C3 and C4 are defined so that C4 exceeds the value in A7. We 
now have 

P{n^l\2o\ > ycT) < + max > Qn^o G +P(0o G AA^). 

Wl-^ \\<i<n J 

On the event 0o G AA, assumption A6 implies that gi = g{xi;Oni) for Oni G AA, 
and, hence, (by A7) P(maxi<j<„ \b"'{gi)\ >C4n9o£M) = 0. Along with A5, 
we thus have 



Now consider 

P{n\aj\\eij\ > Ci) < PK|dj| > \fc'i) + P(n^-'=|eij| > ^^Cl 
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for c a number in (0, 1/2). We may write 
p+j ^ 2 " 

and so 



1 ^ ■ 1 



P+j P+j / 2 " 



r=l r=l \ 1=1 



def^'^-'' - 2 
r=l 



Before proceeding, we define some matrix notation. Let A denote the 
matrix A„,^x in A4, and W and W the nx n diagonal matrices with respec- 
tive diagonal elements b" {g{xi;9o)) and b" {g{xi;Oo)), i = 1, . . . ,n. Matrices 
B and B are the R matrices in the QR decompositions of W^/^A/-^/n and 
"W^/^A/^/n, respectively. We then have 

p+j 

J20rj-f3rjf < ip + j)max0rj-f3rj? < {p + j)\\B~^ - \\l 

= (p + j)||B-i(B-B)B-i||2 
^ (p + i)||B-B||l 

" C72(B)C72(B) ' 

where (t(M) denotes the smahest singular value of matrix M. A result of 
Drmac, Omladic and Veselic (1994) implies that 

IIB < (RK^ ^ .P2K^\ IIA^WA/nlbllA^WA-A^WAIIi 
||B - BL < (8K + V2if ) ^2(ATWA) ■ 

Assumptions A2, A4 and A7 and basic properties of matrix norms [Golub 
and Van Loan (1996)] now imply that 

||B - B||2 < Cj{m'' + V2K^) max {b"{g{^i; Oq) - h" {g{i^i-eo))f . 

l<i<n 

Combining previous results yields 

P(n2(i-^)e?j- > Ci) < P^n'^ip + j){SK^ + V2K^) 

X max (6"(5(x,; 0o)) - 6"(<7(xi; eo))f/a\B) > C, 

l<i<n 

+ P(n2(i-^)-^Z„,> V^) 
<(p + i)[C9i^V-^ + Cioni-2--] 
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and 



Taking 1 - 2c = (1 - a)/5 = [a - (1 - 2c)]/2 = 1/8 and demanding that K = 
o((n}^^) ensures that the right-hand side above tends to 0. 

Since n{eij + e2jf < 2n{ejj + elj), the term J2f=i P{\T3j\ > e'/3) can be 
bounded by a quantity that is asymptotically negligible in comparison to 
J2f=iP{\T2j\ > e'/3). Combining all the previous steps, it now follows that 
5n tends to in probability as n — > oo. 

We turn now to step (b) in our proof. We may write 

Sn — bx c' , ''^^ 
= -JKn H , 

ax a-K 

where 

Sku 

ax 

We will first show that Sxn converges in distribution to a stable law, and 
then that rxn/o-K converges in probability to 0. Now let Fxn be the c.d.f. of 
Sku, and Fx the c.d.f. of a random variable having exactly the same form 
as Sxn but with Uin, ■ ■ ■ , Uxn replaced by Zl, . . . , Zj^, where Zi, Z2, ■ ■ ■ are 
i.i.d. random variables having the standard normal distribution. Obviously, 

Fxnix) = Fk{x) + {FKn{x) " Fk{x)). 

Theorem 1.8, pages 50 and 51 of Samorodnitsky and Taqqu (1994), implies 
that Fk converges uniformly to F, where F is the 5'i(l, 1,0) stable law, in 
the notation of Samorodnitsky and Taqqu (1994). 
Now Fk can be written as 

K 



Ef=iexp(C/,„/2)-6x 



Fk{x) =p[Y.eMZ]/2) < aKX + hi^ 
= P{iZi,...,ZK)£A,,K). 



Likewise, 



FKn{x) = P{y/n{ai, . . .,aK)/Va{rfy)^Ax,K)- 

Due to the convexity of the exponential function, the sets Ax,k are convex 
for all X and K, and, hence, we may apply the multivariate Berry-Esseen 
theorem of Gotze (1991) to obtain the bound 
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for all K >6, where 



(na(7?o))3/2 ^1 



K 



1=1 



3/2 



The uniform boundedness of b"'{g{- ;0q)) (A7) now implies that 



Ci6 " 



n 



K 



j=lLi=l 



3/2 



< 



1=1 j=i 



n 



1/2 



Finally, then 



n 



1/2 



and the right-hand side of the last expression tends to by A8. 
Finally, consider 



\rKn\ 



j:f=iexp{U,n/'2)-bK + bK 



ax 

= Sn[Op{l) + bx/ax] 

= Op{l) +5nbK/aK- 

It is straightforward to show that |6_ft:/aj^ | < Cig log A'. Examining our proof 
that 5n converges in probability to makes it clear that (5„ log K does also, 
and, hence, the proof is complete. □ 



Proof of Theorem 4. For all K > m, we have 

Sn — bx 



ax 



where 



W 1 n/ I0^ H T T.'^=m+l^My^n/2)-bx 

Wn = — 2^ exp(V,„,/2) and r„ = — 



ax 



ax 
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Obviously 

P{Wn + Tn> So) = P{Tn > Sa) + P{Wn + T„ > H T„ < 

We first consider the case where = Without loss of generality, sup- 
pose the largest \4>- j\ is 1 0m I) and consider, for any £■ > 0, 

m— 1 \ /m— 1 , 

— J2 >ej<Pi^\J |exp(F,„/2) > 

1 ^ ^ 

<Y^p(eMv,n/2)>:^j. 

Now Vjn = naj / a{fiQ) , where 
1 " 

aj = -Y,[Yi-b'{g{^i;eo))]v,i^i) 

1 " 

1=1 

and the last statement follows by arguing as in the proof of Theorem 3. 
Continuing from the last expression, we have 

1 " 

i=l 
1 " 



1=1 



1 " 



1=1 



( 71 +72V21ogajr ^ , r)^ -ii ^ , i -l/2^ 
+ 1= 0j+O(n \ogaK)+Op{n ') 



We may thus write 

(A.2) = Z,^ + (71 + 72^/2bi^)^^ + Op(l), 

where Zj^ converges in distribution to a standard normal random variable, 
and we have used the fact that ?7o is consistent for r/o under our local alterna- 
tives. Using (A.2) and the fact that ^2\4>j\/ \/ < 1 for j = 1, . . . , m — 1, 
it is easy to verify that 
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as n — > oo for each j = 1, . . . , m — 1. Combined with previous results, this 
imphes that J2]l=i^ exp(Vj>i/2)/ai<' converges to in probabihty when = 
C, and, hence, the power has the same hmit as 

P(Tn > Sa) + P{eX];){Vmn/'^)/aK +Tn> SaHTn < Sq,). 

Define T„ by 

axisa - Tn) = max(l,ax(sQ, - r„)). 

Using (A. 2) and some straightforward algebra, and assuming without loss 
of generality that (p^n > 0, we have 

Zmn + (71/72 + V2 log QK ) + Op(l) 

log a/^ + 2 log(sQ — T„) =^ exp(14n„/2)/a/^ + T„ > Sq. 
By Taylor's expansion, 

^/2\ogaK + 21og(g^ - r„) = V21ogaj^ + ^°g(^"l^") ^ 

where Un is between 21ogaA- and 2\ogaK + 21og(sQ, — T„). 
For any < e < 1, define = I{-oa^s^-e){Tn)- Then 

P[ Zr^^n + — + Op 1 > n In,e = 1 

V 72 V21ogaK + 21oge 

< P\ Zmn H V Op(l) > 7= n = 

V 72 Vf4 

< P ^exp (^-^^^ /oi^ + In > Sa n = 1^ . 

Now, arguing as in the proof of Theorem 3, {Zmn,Tn) converges in distribu- 
tion to {Z,T), where Z and T are independent with standard normal and 
1,0) distributions, respectively. Since aj^- — > oo, this implies that 



I log(5a - Tn)\In,e 

y/2 log ax + 2 log e 

and so 



^ 72 V2 log ai^ + 2 log e 

,p(^Z> -— nr<sa-eV 



72 

Combining previous results, and by the arbitrariness of e, we now have 
(A.3) liminf P(exp(y^„/2)/ax + T„ > s«) > a + (1 - a)$(7i/72). 
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Now, for any e > 0, 

P(exp(Kin/2)/ai^ + Tn > n r„, < Sa) 

< P{exp{Vmn/2)/aK > e n T„ < s«) + -e<Tn< Sa). 
Arguing as we did before, the very last quantity converges to 

(1 - a)$(7i/72) + P{sa -e<T< 
and, hence, by the arbitrariness of e, 

hmsupP(exp(y^„/2)/a/^ +r„ >Sa)<a + {l- a)«5(7i/72)- 

n— >oo 

Combined with (A. 3), this yields Theorem 4 for the case — C- 

When 72~^ > C) we may show that Wn tends to in probability using 
the same argument that was applied to Wn — exp(V^„/2)/ai<- in the case 
72"'=C._ 

For 72 ^ < Ci the limiting power is at least 

P(Tn > Sa) + P{eX];){Vmn/2)/aK + Tn> SaHTn < S„), 

and if we follow exactly the same steps used in the case 72^^ = Cj we may 
establish that the limiting power is 1. □ 

Proof of Theorem 5. Let Zi,Z2, ... be a sequence of i.i.d. standard 
normal random variables, and define dx = 2\ogK — log log — logvr. We 
first assume that Hq is true. Since 

max — 2 max |i?,„|<i?„< max ?7i„ + 2 max |i?,„| 
i<j<K ^ i<j<K^ ^ i<j<K ^ i<i<A'' ^ ' 

and we have already shown that 5n = maxi<j<i^ | exp(Rjn) — 1| is asymptot- 
ically negligible, it suffices to study the distribution of i?„ = maxi<j</^ Ujn- 
Let GnK and Gk denote the distribution functions of Rn — dx and maxi<j<x Z| 
dx, respectively. The random variable sup^lGnxix) — Gk{x)\ is bounded 
by exactly the same quantity as was sup^ \FnK{x) — Fk{x)\ in the proof of 
Theorem 3, and, hence, we need only consider 



Pi max Z^<x + dK]=[l- 2(1 - ^V^+d^))]' 

+ o[K ^ 



\Jx + dK 

= exp(-e--/2) + o(l), 
which completes the proof in the null case. 



K 
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Now assume that the local alternatives of Theorem 4 hold, and define 
WiK = max Ujn and W2K = max 

l<j<m. m<j<K 

As in the proof of Theorem 4, we assume without loss of generality that the 
largest value of \(j)j \ is at j = m. Three facts are key in the rest of the proof: 

(i) Rn has the same limiting distribution as Rn- 

(ii) PiWiK > dx + Xa) converges to as n — > 00. 

(iii) W2K — dx has a limiting distribution equal to that in the null case. 

Proof of (i)-(iii) is not provided here since it closely parallels arguments in 
the proof of Theorem 4. 

Facts (i) and (ii) imply that 

P{Rn -dK> Xa) = P{Umn > + U W2K - dx > Xa) + o(l). 

As in the proof of Theorem 4, it is easy to check that 
lim P{Umn > + x„) = $(7iC). 

n — ^oo 

This along with (iii) and the fact that Umn and W2k are asymptotically 
independent implies that 

lim P{Umn ydx+Xai-i W2K -dK> Xa ) 

= $(7iC)+a-^>(7iC)a 
= a + (l-a)$(7iC), 
which completes the proof. □ 
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